Feat: sebulba rec ippo #1142

SimonDuToit · 2024-11-18T16:14:22Z

Sebulba implementation of recurrent IPPO.

OmaymaMahjoub

Overall the system looks correct and reasonable. Well done Simon! I just kept minor requests :)

mava/configs/default/rec_ippo_sebulba.yaml

mava/systems/ppo/sebulba/rec_ippo.py

OmaymaMahjoub · 2024-12-11T12:51:56Z

mava/systems/ppo/sebulba/rec_ippo.py

@@ -0,0 +1,910 @@
+# Copyright 2022 InstaDeep Ltd. All rights reserved.


If you can update the typings in Pipeline mava/utils/sebulba.py to be Union[PPOTransition, RNNPPOTransition]

This causes errors in the pre-commit. For now I changed both sebulba systems to use the MavaTransition type-var but this is probably a temporary solution.

Can you please make an issue for this. I think the best solution is to make a protocol with the all the common things in a transition (actions, obs, done, reward). The challenge is that named tuples don't seem to work with protocols so we'd likely need to switch to a flax/chex dataclass

OmaymaMahjoub · 2024-12-11T12:55:54Z

mava/systems/ppo/sebulba/rec_ippo.py

+        log_prob = actor_policy.log_prob(action)
+        # It may be faster to calculate the values in the learner as
+        # then we won't need to pass critic params to actors.
+        # value = critic_apply_fn(params.critic_params, observation).squeeze()


if you can remove this comment

mava/systems/ppo/sebulba/rec_ippo.py

sash-a

Looks great! Pretty much good to go except a few minor style changes to bring it up to date with the latest PPO changes that went in at the end of last year

sash-a · 2025-01-09T09:52:23Z

mava/systems/ppo/sebulba/rec_ippo.py

+
+            def _update_minibatch(train_state: Tuple, batch_info: Tuple) -> Tuple:
+                """Update the network for a single minibatch."""
+                # UNPACK TRAIN STATE AND BATCH INFO


In line with PPO updates I did last year

Suggested change

# UNPACK TRAIN STATE AND BATCH INFO

sash-a · 2025-01-09T09:52:32Z

mava/systems/ppo/sebulba/rec_ippo.py

+                    key: chex.PRNGKey,
+                ) -> Tuple:
+                    """Calculate the actor loss."""
+                    # RERUN NETWORK


Suggested change

# RERUN NETWORK

# Rerun network

sash-a · 2025-01-09T09:52:42Z

mava/systems/ppo/sebulba/rec_ippo.py

+                    targets: chex.Array,
+                ) -> Tuple:
+                    """Calculate the critic loss."""
+                    # RERUN NETWORK


Suggested change

# RERUN NETWORK

# Rerun network

sash-a · 2025-01-09T09:52:53Z

mava/systems/ppo/sebulba/rec_ippo.py

+                        critic_params, traj_batch.hstates.critic_hidden_state[0], obs_and_done
+                    )
+
+                    # CALCULATE VALUE LOSS


Suggested change

# CALCULATE VALUE LOSS

# Calculate value loss

sash-a · 2025-01-09T09:54:04Z

mava/systems/ppo/sebulba/rec_ippo.py

+                    loss_actor1 = ratio * gae
+                    loss_actor2 = (
+                        jnp.clip(
+                            ratio,
+                            1.0 - config.system.clip_eps,
+                            1.0 + config.system.clip_eps,
+                        )
+                        * gae
+                    )
+                    loss_actor = -jnp.minimum(loss_actor1, loss_actor2)
+                    loss_actor = loss_actor.mean()
+                    # The seed will be used in the TanhTransformedDistribution:
+                    entropy = actor_policy.entropy(seed=key).mean()
+
+                    total_loss = loss_actor - config.system.ent_coef * entropy
+                    return total_loss, (loss_actor, entropy)


Suggested change

loss_actor1 = ratio * gae

loss_actor2 = (

jnp.clip(

ratio,

1.0 - config.system.clip_eps,

1.0 + config.system.clip_eps,

)

* gae

)

loss_actor = -jnp.minimum(loss_actor1, loss_actor2)

loss_actor = loss_actor.mean()

# The seed will be used in the TanhTransformedDistribution:

entropy = actor_policy.entropy(seed=key).mean()

total_loss = loss_actor - config.system.ent_coef * entropy

return total_loss, (loss_actor, entropy)

actor_loss1 = ratio * gae

actor_loss2 = (

jnp.clip(

ratio,

1.0 - config.system.clip_eps,

1.0 + config.system.clip_eps,

)

* gae

)

actor_loss = -jnp.minimum(actor_loss1, actor_loss2)

actor_loss = actor_loss.mean()

# The seed will be used in the TanhTransformedDistribution:

entropy = actor_policy.entropy(seed=key).mean()

total_loss = actor_loss - config.system.ent_coef * entropy

return total_loss, (actor_loss, entropy)

sash-a · 2025-01-09T11:17:55Z

mava/systems/ppo/sebulba/rec_ippo.py

+
+                # Calculate critic loss
+                critic_grad_fn = jax.value_and_grad(_critic_loss_fn, has_aux=True)
+                critic_loss_info, critic_grads = critic_grad_fn(


Suggested change

critic_loss_info, critic_grads = critic_grad_fn(

value_loss_info, critic_grads = critic_grad_fn(

sash-a · 2025-01-09T11:18:29Z

mava/systems/ppo/sebulba/rec_ippo.py

+                critic_grads, critic_loss_info = jax.lax.pmean(
+                    (critic_grads, critic_loss_info), axis_name="learner_devices"
+                )


Suggested change

critic_grads, critic_loss_info = jax.lax.pmean(

(critic_grads, critic_loss_info), axis_name="learner_devices"

)

critic_grads, value_loss_info = jax.lax.pmean(

(critic_grads, value_loss_info), axis_name="learner_devices"

)

sash-a · 2025-01-09T11:19:37Z

mava/systems/ppo/sebulba/rec_ippo.py

+                actor_total_loss, (actor_loss, entropy) = actor_loss_info
+                critic_total_loss, (value_loss) = critic_loss_info
+                total_loss = critic_total_loss + actor_total_loss
+                loss_info = {
+                    "total_loss": total_loss,
+                    "value_loss": value_loss,
+                    "actor_loss": actor_loss,
+                    "entropy": entropy,
+                }


Suggested change

actor_total_loss, (actor_loss, entropy) = actor_loss_info

critic_total_loss, (value_loss) = critic_loss_info

total_loss = critic_total_loss + actor_total_loss

loss_info = {

"total_loss": total_loss,

"value_loss": value_loss,

"actor_loss": actor_loss,

"entropy": entropy,

}

actor_loss, (_, entropy) = actor_loss_info

value_loss, (unscaled_value_loss) = value_loss_info

total_loss = actor_loss + value_loss

loss_info = {

"total_loss": total_loss,

"value_loss": unscaled_value_loss,

"actor_loss": actor_loss,

"entropy": entropy,

}

sash-a · 2025-01-09T11:21:34Z

mava/systems/ppo/sebulba/rec_ippo.py

+            batch = tree.map(
+                lambda x: x.reshape(
+                    config.system.recurrent_chunk_size,
+                    num_learner_envs * num_recurrent_chunks,


Can you extract this into a variable called batch_size. Will make it more clear and it's also used below

sash-a · 2025-01-09T11:27:38Z

mava/systems/ppo/sebulba/rec_ippo.py

+            # batch_size = config.system.rollout_length * num_learner_envs
+            # permutation = jax.random.permutation(shuffle_key, batch_size)


Suggested change

# batch_size = config.system.rollout_length * num_learner_envs

# permutation = jax.random.permutation(shuffle_key, batch_size)

SimonDuToit added 2 commits November 18, 2024 18:09

recurrent ippo

1259b01

linting

dbf837c

SimonDuToit requested review from RuanJohn, sash-a, OmaymaMahjoub, WiemKhlifi and Louay-Ben-nessir as code owners November 18, 2024 16:14

pull-request-size bot added the size/XL label Nov 18, 2024

OmaymaMahjoub requested changes Dec 11, 2024

View reviewed changes

OmaymaMahjoub assigned SimonDuToit Dec 11, 2024

SimonDuToit and others added 4 commits January 7, 2025 14:17

Merge branch 'develop' into feat/sebulba_rec_ippo

a5fb284

cleanup and compatibility

d1cf014

unified transition type

c5c9a36

Merge branch 'develop' into feat/sebulba_rec_ippo

1fe33f6

sash-a requested changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: sebulba rec ippo #1142

Feat: sebulba rec ippo #1142

SimonDuToit commented Nov 18, 2024 •

edited

Loading

OmaymaMahjoub left a comment

OmaymaMahjoub Dec 11, 2024

SimonDuToit Jan 7, 2025

sash-a Jan 9, 2025

OmaymaMahjoub Dec 11, 2024

sash-a left a comment

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

sash-a Jan 9, 2025

		@@ -0,0 +1,910 @@
		# Copyright 2022 InstaDeep Ltd. All rights reserved.

	critic_loss_info, critic_grads = critic_grad_fn(
	value_loss_info, critic_grads = critic_grad_fn(

		# batch_size = config.system.rollout_length * num_learner_envs
		# permutation = jax.random.permutation(shuffle_key, batch_size)

Feat: sebulba rec ippo #1142

Are you sure you want to change the base?

Feat: sebulba rec ippo #1142

Conversation

SimonDuToit commented Nov 18, 2024 • edited Loading

OmaymaMahjoub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sash-a left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonDuToit commented Nov 18, 2024 •

edited

Loading